Han 翻译

Hierarchical Attention Networks for Document Classification

用于文本分类的分层注意力网络

文章目录

0 Abstract
1 Introduction
2 Hierarchical Attention Networks
3 Experiments
4 Related Work
5 Conclusion
Referencesd

0 Abstract

We propose a hierarchical attention network for document classification.

Our model has two distinctive characteristics:

(i) it has a hierarchical structure that mirrors the hierarchical structure of documents;
(ii) it has two levels of attention mechanisms applied at the word and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation.

Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin.

Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.

我们提出了一个用于文本分类的分层注意力网络。

我们的模型有两个显著的特征：
（1）具有反映文档结构的层次结构
（2）他有两层应用在单词和句子层面的注意力机制，使它能够在构造文本特征的时候分别注意重要和不重要的内容

在六个大型的文本分类任务上的实验表明 (我们)提出的结构在很大程度上比之前的方法好。

注意力层可视化显示该模型选择定性的信息丰富的单词和句子。

1 Introduction

paragraph 1

Text classification is one of the fundamental task in Natural Language Processing.

The goal is to assign labels to text.

It has broad applications including topic labeling (Wang and Manning, 2012),sentiment classification (Maas et al., 2011; Pang and Lee,2008), and spam detection (Sahami et al., 1998).

Traditional approaches of text classification represent documents with sparse lexical features, such as n-grams, and then use a linear model or kernel methods on this representation (Wang and Manning,2012; Joachims, 1998).

More recent approaches used deep learning, such as convolutional neural networks (Blunsom et al., 2014) and recurrent neural networks based on long short-term memory (LSTM)(Hochreiter and Schmidhuber, 1997) to learn text representations.

文本分类是自然语言处理中的基础任务之一。

目的是为文本分配标签。

它具有广泛的应用，包括主题标签（Wang和Manning，2012年），情感分析（Maas等人，2011年；Pang和Lee，2008年），和垃圾电邮检测（Sahami等，1998年）。

传统的文本分类方法具有稀疏词典特征（例如n-gram）的文档,然后对该表示使用线性模型或核方法（Wang和Manning，2012;Joachims，1998年）。

最近的方法使用深度学习，例如卷积神经网络（Blunsom等人，2014）和基于长短期记忆（LSTM）的递归神经网络（Hochreiter and Schmidhuber，1997）来学习文本表示。

paragraph 2

Although neural-network–based approaches to text classification have been quite effective (Kim,2014; Zhang et al., 2015; Johnson and Zhang, 2014;Tang et al., 2015), in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture.

The intuition underlying our model is that not all parts of a document are equally relevant for answering a query and that determining the relevant sections involves modeling the interactions of the words, not just their presence in isolation.

尽管基于神经网络的文本分类方法已经非常有效（Kim，2014； Zhang等，2015； Johnson和Zhang，2014；Tang等，2015），但在本文中，我们还是检验了更好的表示方法的假设。通过将文档结构的知识合并到模型体系结构中可以获得。

我们模型所基于的直觉是，并非文档的所有部分对于回答查询都具有同等的相关性，并且确定相关部分涉及对单词的交互进行建模，而不仅仅是对它们的单独存在进行建模。

paragraph 3

Our primary contribution is a new neural architecture (§2), the Hierarchical Attention Network(HAN) that is designed to capture two basic insights about document structure.

First, since documents have a hierarchical structure (words form sentences,sentences form a document), we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation.

Second, it is observed that different words and sentences in a documents are differentially informative.

Moreover, the importance of words and sentences are highly context dependent,i.e. the same word or sentence may be differentially important in different context (§3.5).

我们的主要贡献是新的神经体系结构（§2），即分层注意网络（HAN），其目的是捕获有关文档结构的两个基本见解。

首先，由于文档具有层次结构（单词构成句子，句子构成文档），我们同样地通过首先构建句子的表示形式，然后将其汇总为文档表示形式来构造文档表示形式。

其次，可以观察到文档中不同的单词和句子具有差异性。

而且，单词和句子的重要性在很大程度上取决于上下文，即，相同的单词或句子在不同的上下文中可能具有不同的重要性（第3.5节）。

To include sensitivity to this fact, our model includes two levels of attention mechanisms (Bahdanau et al.,2014; Xu et al., 2015) — one at the word level andone at the sentence level — that let the model to pay more or less attention to individual words and sentences when constructing the representation of the document.

To illustrate, consider the example in Fig. 1, which is a short Yelp review where the task is to predict the rating on a scale from 1–5.

为了包括对这一事实的敏感性，我们的模型包括两个级别的注意力机制（Bahdanau等，2014；Xu等，2015）–一个在单词级别，另一个在句子级别–使模型付出更多或更高构建文档表示时，对单个单词和句子的关注较少。

为了说明这一点，请考虑图1中的示例，该示例是Yelp的简短回顾，其任务是以1-5的等级预测等级。

Intuitively, the first and third sentence have stronger information in assisting the prediction of the rating; within these sentences, the word delicious, a-m-a-z-i-n-g contributes more in implyingthe positive attitude contained in this review.

Attention serves two benefits: not only does it often result in better performance, but it also provides insight into which words and sentences contribute to the classification decision which can be of value in applications and analysis (Shen et al., 2014; Gao et al., 2014).

从直觉上讲，第一和第三句话在帮助评估评分方面具有更强的信息；在这些句子中，美味（a-m-a-z-i-n-g）一词在暗示本评论中所包含的积极态度方面起了更大的作用。

注意有两个好处：不仅可以经常带来更好的性能，而且还可以洞察哪些词和句子有助于分类决策，这在应用和分析中可能是有价值的（Shen等人，2014；Gao等人，2014）。

paragraph 4

The key difference to previous work is that our system uses context to discover when a sequence of tokens is relevant rather than simply filtering for (sequences of) tokens, taken out of context.

To evaluate the performance of our model in comparison to other common classification architectures, we look at six data sets (§3). Our model outperforms previous approaches by a significant margin.

与先前工作的主要区别在于，我们的系统使用上下文来发现令牌序列何时相关，而不是简单地过滤掉上下文中的令牌（序列）。

为了评估模型与其他常见分类体系结构相比的性能，我们查看了六个数据集（第3节）。我们的模型大大优于以前的方法。

2 Hierarchical Attention Networks

The overall architecture of the Hierarchical Attention Network (HAN) is shown in Fig. 2.

It consists of several parts:

a word sequence encoder,
a word-level attention layer,
a sentence encoder (and)
a sentence-level attention layer.

We describe the details of different components in the following sections.

分层注意力网络（HAN）的总体体系结构如图2所示。

它由几个部分组成：

单词序列编码器，

单词级别的注意力层，

句子编码器(和)

句子级别的注意力层。

我们将在以下各节中描述不同组件的详细信息。

2.1 GRU-based sequence encoder

paragraph 1

The GRU (Bahdanau et al., 2014) uses a gating mechanism to track the state of sequences without using separate memory cells.

GRU（Bahdanau等人，2014）使用门控机制来跟踪序列状态，而无需使用单独的存储单元。

There are two types of gates: the reset gate $r_t$ and the update gate $z_t$. They together control how information is updated to the state.
At time t, the GRU computes the new state as

$ ht = (1-z_t) \odot h{t-1} + z_t \odot \tilde{h_t} \quad (1)$

门有两种类型：重置门 $rt$ 和更新门 $z_t$ 。它们一起控制如何将信息更新为状态。
在时间t，GRU计算新状态为
$ h_t = (1-z_t) \odot h{t-1} + z_t \odot \tilde{h_t} \quad (1)$

This is a linear interpolation between the previous state $h_{t−1}$ and the current new state $ \tilde{h_t} $ computed with new sequence information.
The gate $z_t$ decides how much past information is kept and how much new information is added. $z_t$ is updated as:

$ zt = \sigma(W_zx_t+U_zh{t-1}+b_z), \quad (2)$

where $x_t$ is the sequence vector at time t.

这是在先前状态$ h {t-1} $和使用新序列信息计算出的当前新状态$ \tilde {h_t} $之间的线性插值。
门$ z_t $决定保留多少过去信息以及添加多少新信息。 $ z_t $更新为：
$ z_t = \sigma(W_zx_t+U_zh{t-1}+b_z), \quad (2)$
其中$ x_t $是时间t的序列向量。

The candidate state $\tilde{ht}$ is computed in a way similar to a traditional recurrent neural network (RNN):
$\tilde{h_t}=tanh(W_hx_t+r_t\odot(U_hh{t-1})+b_h), \qquad (3)$

候选状态$ \tilde {ht} $的计算方式类似于传统的递归神经网络（RNN）：
$\tilde{h_t}=tanh(W_hx_t+r_t\odot(U_hh{t-1})+b_h), \qquad (3)$

Here $rt$ is the reset gate which controls how much the past state contributes to the candidate state. If $r_t$ is zero, then it forgets the previous state. The reset gate is updated as follows:
$r_t = \sigma(W_r x_t + U_r h{t-1} + b_r)\qquad (4)$

$ rt $是复位门，它控制过去状态对候选状态的贡献量。如果$ r_t $为零，则它将忘记先前的状态。重置门更新如下：
$r_t = \sigma(W_r x_t + U_r h{t-1} + b_r)\qquad (4)$

2.2 Hierarchical Attention

paragraph 1

We focus on document-level classification in this work. Assume that a document has $ \it L$ sentences $si$ and each sentence contains $T_i$ words. $w{it}$ with t $\in$ $[0,T]$ represents the words in the $\it i$ th sentence. The proposed model projects the raw document into a vector representation, on which we build a classifier to perform document classification. In the following, we will present how we build the document level vector progressively from word vectors by using the hierarchical structure.

在这项工作中，我们专注于文档级分类。假设一个文档有$ \it L $个句子$ si $，每个句子包含$ T_i $个单词。 $ w{it} $表示第i个句子中的单词,其中 t $\in$ $[0,T]$。提出的模型将原始文档投影到矢量表示中，我们在其上构建分类器以执行文档分类。在下面，我们将介绍如何使用层次结构从单词向量逐步构建文档级别向量。

paragraph 2

$\bf Word \ \ Encoder$ Given a sentence with words $ w{it} $,t $\in$ $[0,T]$, we first embed the words to vectors through an embedding matrix $W_e$, $x{ij}$ = $Wew{ij}$. We use a bidirectional GRU (Bahdanau et al., 2014) to get annotations of words by summarizing information from both directions for words, and therefore incorporate the contextual information in the annotation.
The bidirectional GRU contains the forward GRU $\overrightarrow{f}$ which reads the sentence $si$ from $w{i1}$ to ${wiT}$ and a backward GRU $\overleftarrow{f}$ which reads from $w{iT}$ to $w_{i1}$:

$x{it} = W_ew{it},\ t \in$ $[1,T],$
$\overrightarrow{h}{it} = \overleftarrow{GRU}(x{it}) ,\ t \in$ $[1,T],$
$\overleftarrow{h}{it} = \overrightarrow{GRU}(x_{it}) ,\ t \in$ $[T,1],$

$\bf Word \ \ Encoder$ 给定包含单词$ w {it} $，t $ \in $ $ [0，T] $的句子，我们首先通过嵌入矩阵$ W_e $将单词嵌入向量中，$ x{ij} $ = $ Wew{ij} $。我们使用双向GRU（Bahdanau等人，2014）通过汇总来自两个方向的单词信息来获取单词注释，因此将上下文信息合并到注释中。
双向GRU包含前向GRU $ \overrightarrow {f} $，将句子$ si $从$ w{i1} $读到${wiT}$；向后GRU $ \overleftarrow {f} $，从$ w{iT} $读取到$ w {i1} $：
$x{it} = Wew{it},\ t \in$ $[1,T],$
$\overrightarrow{h}{it} = \overleftarrow{GRU}(x{it}) ,\ t \in$ $[1,T],$
$\overleftarrow{h}{it} = \overrightarrow{GRU}(x_{it}) ,\ t \in$ $[T,1],$

We obtain an annotation for a given word $ w{iT} $ by concatenating the forward hidden state $\overrightarrow{h}{it}$ and backward hidden state $\overleftarrow{h}{it}$, i.e., ${h}{it}$ = [ $\overrightarrow{h}{it} , \overleftarrow{h}{it} ]$, which summarizes the information of the whole sentence centered around $ w_{iT} $.

我们通过将前向隐藏状态$\overrightarrow{h}{it}$和后向隐藏状态$\overleftarrow{h}{it}$（即${h}{it}$ = [ $\overrightarrow{h}{it} , \overleftarrow{h}{it} ]$）进行级联来获得给定单词 $ w{it} $的注释，该注释总结了以$ w_{it} $为中心的整个句子的信息。

paragraph 3

Note that we directly use word embeddings. For a more complete model we could use a GRU to get word vectors directly from characters, similarly to (Ling et al., 2015). We omitted this for simplicity.

请注意，我们直接使用单词嵌入。对于更完整的模型，类似于（Ling等人，2015），我们可以使用GRU直接从字符中获取单词向量。为了简单起见，我们省略了这一点。

paragraph 4

$\bf Word \ \ Attention$ Not all words contribute equally to the representation of the sentence meaning. Hence, we introduce attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector. Specifically,

$ u{it} = tanh(W_wh{it} + bw) \qquad (5)$
$ \alpha{it} = \frac{exp(u{it}\top u_w)}{\sum_t exp(u{it}\top uw)} \qquad \qquad \ \ (6)$
$ s_i = \sum_t \alpha{it}h_{it} \qquad \qquad \qquad \ (7)$

$\bf Word \ \ Attention$ 并非所有单词都对句子含义的表示有同等的贡献。因此，我们引入注意力机制来提取对句子的意义很重要的单词，并汇总这些信息性单词的表示形式以形成句子向量。特别，
$ u{it} = tanh(W_wh{it} + bw) \qquad (5)$
$ \alpha{it} = \frac{exp(u{it}\top u_w)}{\sum_t exp(u{it}\top uw)} \qquad \qquad \ \ (6)$
$ s_i = \sum_t \alpha{it}h_{it} \qquad \qquad \qquad \ (7)$

That is, we first feed the word annotation hit through a one-layer $\sf MLP$ to get $u{it}$ as a hidden representation of $h{it}$, then we measure the importance of the word as the similarity of $u{it}$ with a word level context vector $u_w$ and get a normalized importance weight $\alpha{it}$ through a softmax function.

也就是说，我们首先通过一层$ \sf MLP $来输入单词注释命中，以获取$ u{it} $作为$ h{it} $的隐藏表示，然后我们将单词的重要性作为 $ u{it} $与词级上下文向量$ u_w $的相似性，并通过softmax函数获得归一化的重要性权重$ \alpha{it} $。

After that, we compute the sentence vector $ s_ i $ (we abuse the notation here) as a weighted sum of the word annotations based on the weights.

之后，我们根据权重计算句子向量$ s_i $（在这里我们滥用了表示法）作为单词注释的加权总和。

The context vector $ u_w $ can be seen as a high level representation of a fixed query “what is the informative word” over the words like that used in memory networks (Sukhbaatar et al., 2015; Kumar et al., 2015). The word context vector $ u_w $ is randomly initialized and jointly learned during the training process.

上下文向量$ u_w $可以看作是固定查询“什么是信息性单词”的高级表示，类似于存储网络中使用的单词（Sukhbaatar等，2015； Kumar等，2015）。单词上下文向量$ u_w $在训练过程中被随机初始化并共同学习。

paragraph 5

$\bf Sentence \ \ Encoder$ Given the sentence vectors $s_i$, we can get a document vector in a similar way. We use a bidirectional GRU to encode the sentences:
$\overrightarrow{h_i} = \overrightarrow{GRU}(s_i), i \in [1,L],$
$\overleftarrow{h_i} = \overleftarrow{GRU}(s_i), i \in [L,1].$
We concatenate $\overrightarrow{h_i}$ and $\overleftarrow{h_j}$ to get an annotation of sentence $i$, i.e., $h_i$ = [ $\overrightarrow{h_i}, \overleftarrow{h_i}$]. $h_i$ summarizes the neighbor sentences around sentence $i$ but still focus on sentence $i$.

$ \bf Sentence \ \ Encoder $ 给定句子向量$ s_i $，我们可以以类似的方式获得文档向量。我们使用双向GRU对句子进行编码：
$\overrightarrow{h_i} = \overrightarrow{GRU}(s_i), i \in [1,L],$
$\overleftarrow{h_i} = \overleftarrow{GRU}(s_i), i \in [L,1].$
我们将$ \overrightarrow{h_i} $和$ \overleftarrow{h_j} $连接起来，得到句子$ i $的注释，即$ h_i $ = [$ \overrightarrow {h_i}，\overleftarrow {h_i} $]。 $ h_i $总结了句子$ i $周围的相邻句子，但仍关注句子$ i $。

paragraph 6

$\bf Sentence \ \ Attention$ To reward sentences that are clues to correctly classify a document, we again use attention mechanism and introduce a sentence level context vector $u_s$ and use the vector to measure the importance of the sentences. This yields

$u_i = tanh(W_ih_i+b_s) , \qquad (8)$
$\alpha_i = \frac{exp(u_i\top u_s)}{\sum_t exp(u_i\top u_s)} ,\qquad \qquad (9)$
$ v = \sum\limits_i \alpha_ih_i \qquad \qquad \qquad \ \ (10)$

$ \bf Sentence \ Attention $ 为了奖励正确分类文档的线索，我们再次使用注意机制并引入句子级别上下文向量$ u_s $，并使用该向量来衡量句子的重要性。这产生
$u_i = tanh(W_ih_i+b_s) , \qquad (8)$
$\alpha_i = \frac{exp(u_i\top u_s)}{\sum_t exp(u_i\top u_s)} ,\qquad \qquad (9)$
$ v = \sum\limits_i \alpha_ih_i \qquad \qquad \qquad \ \ (10)$

where $v$ is the document vector that summarizes all the information of sentences in a document. Similarly, the sentence level context vector can be randomly initialized and jointly learned during the training process.

其中$ v $是总结文档中句子所有信息的文档向量。类似地，可以在训练过程中随机初始化句子级别的上下文向量并共同学习。

2.3 Document Classification

paragraph 1

The document vector $v$ is a high level representation of the document and can be used as features for document classification:

$ p $ = softmax$(W_cv+b_c). \qquad (11)$

We use the negative log likelihood of the correct labels as training loss:

$L = - \sum\limitsd logp{dj}, \qquad \qquad \ (12)$

where $j$ is the label of document $d$.

文档向量$ v $是文档的高级表示，可以用作文档分类的特征：
$ p $ = softmax$(Wcv+b_c). \qquad (11)$
我们使用正确标签的负对数似然作为训练损失：
$L = - \sum\limits_d logp{dj}, \qquad \qquad \ (12)$
其中$ j $是文档$ d $的标签。

3 Experiments

3.1 Data sets

paragraph 1

We evaluate the effectiveness of our model on six large scale document classification data sets. These data sets can be categorized into two types of document classification tasks: sentiment estimation and topic classification. The statistics of the data sets are summarized in Table 1. We use 80% of the data for training, 10% for validation, and the remaining 10% for test, unless stated otherwise.

我们在六个大型文档分类数据集上评估了模型的有效性。这些数据集可以分为两种类型的文档分类任务：情感估计和主题分类。数据集的统计信息汇总在表1中。除非另有说明，否则我们将80％的数据用于训练，10％的数据用于验证，将其余10％的数据用于测试。

$\bf Yelp \ \ reviews$ are obtained from the Yelp Dataset Challenge in 2013, 2014 and 2015 (Tang et al., 2015). There are five levels of ratings from 1 to 5 (higher is better).

$\bf Yelp \ \ reviews$ 来自2013、2014和2015年的Yelp数据集挑战赛（Tang et al。，2015）。评分分为5级（从1到5）（越高越好）。

$\bf IMDB \ \ reviews$ are obtained from (Diao et al., 2014). The ratings range from 1 to 10.

$\bf IMDB \ \ reviews$来自（Diao等人，2014）。评分范围是1到10。

$\bf Yahoo \ \ answers$ are obtained from (Zhang et al., 2015). This is a topic classification task with 10 classes: Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships and Politics & Government. The document we use includes question titles, question contexts and best answers. There are 140,000 training samples and 5000 testing samples. The original data set does not provide validation samples. We randomly select 10% of the training samples as validation.

$\bf Yahoo \ \ answers$ 来自（Zhang et al。，2015）。这是一个主题分类任务，共有10个班级：社会与文化，科学与数学，健康，教育与参考，计算机与互联网，体育，商业与金融，娱乐与音乐，家庭与人际关系以及政治与政府。我们使用的文档包括问题标题，问题上下文和最佳答案。有140,000个训练样本和5000个测试样本。原始数据集不提供验证样本。我们随机选择10％的训练样本作为验证。

$\bf Amazon \ \ reviews$ are obtained from (Zhang et al., 2015). The ratings are from 1 to 5. 3,000,000 reviews are used for training and 650,000 reviews for testing. Similarly, we use 10% of the training samples as validation.

$\bf Amazon \ \ reviews$ 来自（Zhang et al。，2015）。评分为1-5。3,000,000条评论用于培训，650,000条评论用于测试。同样，我们使用10％的训练样本作为验证。

3.2 Baselines

paragraph 1

We compare HAN with several baseline methods, including traditional approaches such as linear methods, SVMs and paragraph embeddings using neural networks, LSTMs, word-based CNN, character-based CNN, and Conv-GRNN, LSTMGRNN. These baseline methods and results are reported in (Zhang et al., 2015; Tang et al., 2015).

我们将 HAN 与几种基线方法进行了比较，包括使用神经网络、LSTM、基于词的 CNN、基于字符的 CNN 和 Conv-GRNN、LSTMGRNN 的线性方法、SVM 和段落嵌入等传统方法。这些基线方法和结果在（Zhang 等人，2015 年；Tang 等人，2015 年）中报告。

3.2.1 Linear methods

paragraph 1

Linear methods (Zhang et al., 2015) use the constructed statistics as features. A linear classifier based on multinomial logistic regression is used to classify the documents using the features.

BOW and BOW+TFIDF The 50,000 most frequent words from the training set are selected and the count of each word is used features. Bow+TFIDF, as implied by the name, uses the TFIDF of counts as features.

n-grams and n-grams+TFIDF used the most frequent 500,000 n-grams (up to 5-grams).

Bag-of-means The average word2vec embedding (Mikolov et al., 2013) is used as feature set.

3.2.2 SVMs

paragraph 1

SVMs-based methods are reported in (Tang et al., 2015), including SVM+Unigrams, Bigrams, Text Features, AverageSG, SSWE. In detail, Unigrams and Bigrams uses bag-of-unigrams and bagof-bigrams as features respectively.
(Tang et al., 2015) 报告了基于 SVM 的方法，包括 SVM+Unigrams、Bigrams、Text Features、AverageSG、SSWE。具体来说，Unigrams 和 Bigrams 分别使用 bag-of-unigrams 和 bagof-bigrams 作为特征。

Text Features are constructed according to (Kiritchenko et al., 2014), including word and character n-grams, sentiment lexicon features etc.
根据 (Kirichenko et al., 2014) 构建，包括单词和字符 n-gram、情感词典特征等。

AverageSG constructs 200-dimensional word vectors using word2vec and the average word embeddings of each document are used.
AverageSG 使用 word2vec 构建 200 维词向量，并使用每个文档的平均词嵌入。

SSWE uses sentiment specific word embeddings according to (Tang et al., 2014).
SSWE 根据 (Tang et al., 2014) 使用特定于情感的词嵌入。

3.2.3 Neural Network methods

paragraph 1

The neural network based methods are reported in (Tang et al., 2015) and (Zhang et al., 2015).
基于神经网络的方法在 (Tang et al., 2015) 和 (Zhang et al., 2015) 中有报道。

CNN-word Word based CNN models like that of (Kim, 2014) are used.
使用了像 (Kim, 2014) 那样的基于 CNN 词 Word 的 CNN 模型。

CNN-char Character level CNN models are reported in (Zhang et al., 2015).
CNN-char 字符级 CNN 模型在 (Zhang et al., 2015) 中有报道。

LSTM takes the whole document as a single sequence and the average of the hidden states of all words is used as feature for classification.

LSTM 将整个文档作为单个序列，所有单词的隐藏状态的平均值作为分类特征。

Conv-GRNN and LSTM-GRNN were proposed by (Tang et al., 2015). They also explore the hierarchical structure: a CNN or LSTM provides a sentence vector, and then a gated recurrent neural network (GRNN) combines the sentence vectors from a document level vector representation for classification.
Conv-GRNN 和 LSTM-GRNN 是由 (Tang et al., 2015) 提出的。他们还探索了层次结构：CNN 或 LSTM 提供了一个句子向量，然后门控循环神经网络 (GRNN) 将来自文档级向量表示的句子向量组合起来进行分类。

3.3 Model configuration and training

paragraph 1

We split documents into sentences and tokenize each sentence using Stanford’s CoreNLP (Manning et al., 2014). We only retain words appearing more than 5 times in building the vocabulary and replace the words that appear 5 times with a special UNK token. We obtain the word embedding by training an unsupervised word2vec (Mikolov et al., 2013) model on the training and validation splits and then use the word embedding to initialize We.
我们使用斯坦福的 CoreNLP (Manning et al., 2014) 将文档分成句子并标记每个句子。我们在构建词汇表时只保留出现 5 次以上的词，并用特殊的 UNK 标记替换出现 5 次的词。我们通过在训练和验证分割上训练一个无监督的 word2vec (Mikolov et al., 2013) 模型来获得词嵌入，然后使用词嵌入来初始化 We。

paragraph 2

The hyper parameters of the models are tuned on the validation set. In our experiments, we set the word embedding dimension to be 200 and the GRU dimension to be 50. In this case a combination of forward and backward GRU gives us 100 dimensions for word/sentence annotation. The word/sentence context vectors also have a dimension of 100, initialized at random.
模型的超参数在验证集上进行了调整。在我们的实验中，我们将词嵌入维度设置为 200，将 GRU 维度设置为 50。在这种情况下，前向和后向 GRU 的组合为我们提供了 100 个维度的词/句子注释。词/句子上下文向量也有 100 维，随机初始化。

paragraph 3

For training, we use a mini-batch size of 64 and documents of similar length (in terms of the number of sentences in the documents) are organized to be a batch. We find that length-adjustment can accelerate training by three times. We use stochastic gradient descent to train all models with momentum of 0.9. We pick the best learning rate using grid search on the validation set.
对于训练，我们使用 64 的 mini-batch 大小，并将长度相似（就文档中的句子数而言）的文档组织为一个批次。我们发现长度调整可以将训练速度提高三倍。我们使用随机梯度下降来训练动量为 0.9 的所有模型。我们在验证集上使用网格搜索选择最佳学习率。

3.4 Results and analysis

paragraph 1

The experimental results on all data sets are shown in Table 2. We refer to our models as HN-{AVE, MAX, ATT}. Here HN stands for Hierarchical Network, AVE indicates averaging, MAX indicates max-pooling, and ATT indicates our proposed hierarchical attention model. Results show that HNATT gives the best performance across all data sets.
所有数据集的实验结果如表 2 所示。我们将我们的模型称为 HN-{AVE, MAX, ATT}。这里 HN 代表分层网络，AVE 表示平均，MAX 表示最大池化，ATT 表示我们提出的分层注意力模型。结果表明，HNATT 在所有数据集上都提供了最佳性能。

paragraph 2

The improvement is regardless of data sizes. For smaller data sets such as Yelp 2013 and IMDB, our model outperforms the previous best baseline methods by 3.1% and 4.1% respectively. This finding is consistent across other larger data sets. Our model outperforms previous best models by 3.2%, 3.4%, 4.6% and 6.0% on Yelp 2014, Yelp 2015, Yahoo Answers and Amazon Reviews. The improvement also occurs regardless of the type of task: sentiment classification, which includes Yelp 2013-2014, IMDB, Amazon Reviews and topic classification for Yahoo Answers.
改进与数据大小无关。对于 Yelp 2013 和 IMDB 等较小的数据集，我们的模型分别比之前的最佳基线方法高 3.1% 和 4.1%。这一发现在其他更大的数据集上是一致的。我们的模型在 Yelp 2014、Yelp 2015、Yahoo Answers 和 Amazon Reviews 上的表现分别比之前的最佳模型高 3.2%、3.4%、4.6% 和 6.0%。无论任务类型如何，都会发生改进：情感分类，包括 Yelp 2013-2014、IMDB、亚马逊评论和雅虎问答的主题分类。

paragraph 3

From Table 2 we can see that neural network based methods that do not explore hierarchical document structure, such as LSTM, CNN-word, CNNchar have little advantage over traditional methods for large scale (in terms of document size) text classification. E.g. SVM+TextFeatures gives performance 59.8, 61.8, 62.4, 40.5 for Yelp 2013, 2014, 2015 and IMDB respectively, while CNN-word has accuracy 59.7, 61.0, 61.5, 37.6 respectively.
从表 2 中我们可以看出，基于神经网络的不探索分层文档结构的方法，如 LSTM、CNN-word、CNNchar，在大规模（在文档大小方面）文本分类的传统方法几乎没有优势。例如。 SVM+TextFeatures 在 Yelp 2013、2014、2015 和 IMDB 上的性能分别为 59.8、61.8、62.4、40.5，而 CNN-word 的准确度分别为 59.7、61.0、61.5、37.6。

paragraph 4

Exploring the hierarchical structure only, as in
HN-AVE, HN-MAX, can significantly improve over LSTM, CNN-word and CNN-char. For example, our HN-AVE outperforms CNN-word by 7.3%, 8.8%, 8.5%, 10.2% than CNN-word on Yelp 2013, 2014, 2015 and IMDB respectively. Our model HN-ATT that further utilizes attention mechanism combined with hierarchical structure improves over previous models (LSTM-GRNN) by 3.1%, 3.4%, 3.5% and 4.1% respectively. More interestingly, in the experiments, HN-AVE is equivalent to using non-informative global word/sentence context vectors (e.g., if they are all-zero vectors, then the attention weights in Eq. 6 and 9 become uniform weights). Compared to HN-AVE, the HN-ATT model gives superior performance across the board. This clearly demonstrates the effectiveness of the proposed global word and sentence importance vectors for the HAN.

3.5 Context dependent attention weights

上下文相关的注意力权重

paragraph 1 If words were inherently important or not important, models without attention mechanism might work well since the model could automatically assign low weights to irrelevant words and vice versa. However, the importance of words is highly context dependent. For example, the word good may appear in a review that has the lowest rating either because users are only happy with part of the product/service or because they use it in a negation, such as not good. To verify that our model can capture context dependent word importance, we plot the distribution of the attention weights of the words good and bad from the test split of Yelp 2013 data set as shown in Figure 3(a) and Figure 4(a). We can see that the distribution has a attention weight assigned to a word from 0 to 1. This indicates that our model captures diverse context and assign context-dependent weight to the words. 如果单词本质上很重要或不重要，那么没有注意力机制的模型可能会很好地工作，因为模型可以自动为不相关的单词分配较低的权重，反之亦然。然而，单词的重要性高度依赖于上下文。例如，单词 good 可能出现在评分最低的评论中，因为用户只对产品/服务的一部分感到满意，或者因为他们在否定中使用它，例如不好。为了验证我们的模型可以捕获上下文相关的单词重要性，我们从 Yelp 2013 数据集的测试拆分中绘制了单词 good 和 bad 的注意力权重分布，如图 3(a) 和图 4(a) 所示。我们可以看到该分布分配给单词的注意力权重从 0 到 1。这表明我们的模型捕获了不同的上下文，并为单词分配了上下文相关的权重。 paragraph 2

For further illustration, we plot the distribution when conditioned on the ratings of the review. Subfigures 3(b)-(f) in Figure 3 and Figure 4 correspond to the rating 1-5 respectively. In particular, Figure 3(b) shows that the weight of good concentrates on the low end in the reviews with rating 1. As the rating increases, so does the weight distribution. This means that the word good plays a more important role for reviews with higher ratings. We can observe the converse trend in Figure 4 for the word bad. This confirms that our model can capture the context-dependent word importance.

为了进一步说明，我们绘制了以评论评分为条件的分布。图 3 和图 4 中的子图 3(b)-(f) 分别对应于等级 1-5。特别是，图 3(b) 显示，商品的权重集中在评分为 1 的评论中的低端。随着评分的增加，权重分布也随之增加。这意味着，good 这个词在评分较高的评论中扮演着更重要的角色。我们可以观察到图 4 中单词 bad 的相反趋势。这证实了我们的模型可以捕获上下文相关的单词重要性。

3.6 Visualization of attention

注意力的可视化

paragraph 1

In order to validate that our model is able to select informative sentences and words in a document, we visualize the hierarchical attention layers in Figures 5 and 6 for several documents from the Yelp 2013 and Yahoo Answers data sets.
为了验证我们的模型能够选择文档中信息丰富的句子和单词，我们将图 5 和图 6 中的分层注意力层可视化为来自 Yelp 2013 和 Yahoo Answers 数据集中的几个文档。

paragraph 2 Every line is a sentence (sometimes sentences spill over several lines due to their length). Red denotes the sentence weight and blue denotes the word weight. Due to the hierarchical structure, we normalize the word weight by the sentence weight to make sure that only important words in important sentences are emphasized. For visualization purposes we display √pspw. The √ps term displays the important words in unimportant sentences to ensure that they are not totally invisible. 每一行都是一个句子（有时，由于长度的原因，句子会跨越几行）。红色表示句子权重，蓝色表示词权重。由于层次结构，我们通过句子权重对词权重进行归一化，以确保仅强调重要句子中的重要词。出于可视化目的，我们显示√pspw。 √ps 术语显示不重要的句子中的重要词，以确保它们不会完全不可见。 paragraph 3

Figure 5 shows that our model can select the words carrying strong sentiment like delicious, amazing, terrible and their corresponding sentences. Sentences containing many words like cocktails, pasta, entree are disregarded. Note that our model can not only select words carrying strong sentiment, it can also deal with complex across-sentence context. For example, there are sentences like i don’t even like scallops in the first document of Fig. 5, if looking purely at the single sentence, we may think this is negative comment. However, our model looks at the context of this sentence and figures out this is a positive review and chooses to ignore this sentence.
图 5 显示我们的模型可以选择带有强烈情感的词，如美味、惊人、可怕及其对应的句子。包含许多单词的句子，如鸡尾酒、意大利面、主菜将被忽略。请注意，我们的模型不仅可以选择带有强烈情感的词，还可以处理复杂的跨句上下文。比如图5的第一个文档里有我什至不喜欢扇贝这样的句子，如果单纯看单个句子，我们可能会认为这是负面评论。然而，我们的模型查看了这句话的上下文，并认为这是一个正面评价，并选择忽略这句话。

paragraph 4

Our hierarchical attention mechanism also works well for topic classification in the Yahoo Answer data set. For example, for the left document in Figure 6 with label 1, which denotes Science and Mathematics, our model accurately localizes the words zebra, strips, camouflage, predator and their corresponding sentences. For the right document with label 4, which denotes Computers and Internet, our model focuses on web, searches, browsers and their corresponding sentences. Note that this happens in a multiclass setting, that is, detection happens before the selection of the topic!

我们的分层注意力机制也适用于 Yahoo Answer 数据集中的主题分类。例如，对于图 6 中带有标签 1 的左侧文档，它表示科学和数学，我们的模型准确地定位了单词 zebra、strips、camouflage、predictor 及其对应的句子。对于带有标签 4 的右侧文档，表示计算机和互联网，我们的模型侧重于网络、搜索、浏览器及其相应的句子。请注意，这发生在多类设置中，即在选择主题之前进行检测！

paragraph 1

Kim (2014) use neural networks for text classification. The architecture is a direct application of CNNs, as used in computer vision (LeCun et al., 1998), albeit with NLP interpretations. Johnson and Zhang (2014) explores the case of directly using a high-dimensional one hot vector as input. They find that it performs well. Unlike word level modelings, Zhang et al. (2015) apply a character-level CNN for text classification and achieve competitive results. Socher et al. (2013) use recursive neural networks for text classification. Tai et al. (2015) explore the structure of a sentence and use a treestructured LSTMs for classification. There are also some works that combine LSTM and CNN structure to for sentence classification (Lai et al., 2015; Zhou et al., 2015). Tang et al. (2015) use hierarchical structure in sentiment classification. They first use a CNN or LSTM to get a sentence vector and then a bi-directional gated recurrent neural network to compose the sentence vectors to get a document vectors. There are some other works that use hierarchical structure in sequence generation (Li et al., 2015) and language modeling (Lin et al., 2015).
Kim (2014) 使用神经网络进行文本分类。该架构是 CNN 的直接应用，用于计算机视觉 (LeCun et al., 1998)，尽管有 NLP 解释。 Johnson 和 Zhang (2014) 探讨了直接使用高维单热向量作为输入的情况。他们发现它表现良好。与词级建模不同，Zhang 等人。 (2015) 将字符级 CNN 应用于文本分类并取得有竞争力的结果。索切尔等人。 (2013) 使用递归神经网络进行文本分类。泰等人。 (2015) 探索句子的结构并使用树结构的 LSTM 进行分类。还有一些作品将 LSTM 和 CNN 结构结合起来进行句子分类（Lai et al., 2015; Zhou et al., 2015）。唐等人。 (2015) 在情感分类中使用层次结构。他们首先使用 CNN 或 LSTM 得到一个句子向量，然后使用双向门控循环神经网络来组合句子向量以得到一个文档向量。还有一些其他作品在序列生成（Li et al., 2015）和语言建模（Lin et al., 2015）中使用层次结构。

paragraph 2

The attention mechanism was proposed by (Bahdanau et al., 2014) in machine translation. The encoder decoder framework is used and an attention mechanism is used to select the reference words in original language for words in foreign language before translation. Xu et al. (2015) uses the attention mechanism in image caption generation to select the relevant image regions when generating words in the captions. Further uses of the attention mechanism include parsing (Vinyals et al., 2014), natural language question answering (Sukhbaatar et al., 2015; Kumar et al., 2015; Hermann et al., 2015), and image question answering (Yang et al., 2015). Unlike these works, we explore a hierarchical attention mechanism (to the best of our knowledge this is the first such instance).
注意机制是由 (Bahdanau et al., 2014) 在机器翻译中提出的。使用编码器解码器框架，并使用注意力机制在翻译前为外语词选择原始语言中的参考词。徐等人。 (2015) 在生成标题中的单词时使用图像标题生成中的注意力机制来选择相关的图像区域。注意力机制的进一步用途包括解析（Vinyals et al., 2014）、自然语言问答（Sukhbaatar et al., 2015; Kumar et al., 2015; Hermann et al., 2015）和图像问答（Yang 等，2015）。与这些作品不同，我们探索了一种分层注意力机制（据我们所知，这是第一个这样的例子）。

5 Conclusion

paragraph 1

In this paper, we proposed hierarchical attention networks (HAN) for classifying documents. As a convenient side-effect we obtained better visualization using the highly informative components of a document. Our model progressively builds a document vector by aggregating important words into sentence vectors and then aggregating important sentences vectors to document vectors. Experimental results demonstrate that our model performs significantly better than previous methods. Visualization of these attention layers illustrates that our model is effective in picking out important words and sentences.
在本文中，我们提出了用于文档分类的分层注意力网络（HAN）。作为一个方便的副作用，我们使用文档的高信息组件获得了更好的可视化。我们的模型通过将重要单词聚合为句子向量，然后将重要句子向量聚合为文档向量来逐步构建文档向量。实验结果表明，我们的模型性能明显优于以前的方法。这些注意力层的可视化说明我们的模型可以有效地挑选出重要的单词和句子。

Acknowledgments This work was supported by
Microsoft Research.

致谢这项工作得到了微软研究院的支持。

Referencesd

References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Phil Blunsom, Edward Grefenstette, Nal Kalchbrenner, et al. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.
Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J Smola, Jing Jiang, and Chong Wang. 2014. Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 193– 202. ACM.
Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong He, Li Deng, and Yelong Shen. 2014. Modeling interestingness with deep neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
Karl Moritz Hermann, Toma´s Koˇ ciskˇ y, Edward Grefen-` stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. arXiv preprint arXiv:1506.03340.
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long¨ short-term memory. Neural computation, 9(8):1735– 1780.
Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. Springer.
Rie Johnson and Tong Zhang. 2014. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058.
Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
Svetlana Kiritchenko, Xiaodan Zhu, and Saif M Mohammad. 2014. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, pages 723– 762.
Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. 2015. Ask me anything: Dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick´
Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057.
Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou, and Sheng Li. 2015. Hierarchical recurrent neural network for document modeling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 899–907.
Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Ramon Fernan-´ dez Astudillo, Silvio Amir, Chris Dyer, Alan W Black, and Isabel Trancoso. 2015. Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096.
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 142–150. Association for Computational Linguistics.
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135.
Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. 1998. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98– 105.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 101–110. ACM.
Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. EMNLP.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. arXiv preprint arXiv:1503.08895.
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proc. ACL.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1555–1565.
Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1422–1432. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2014. Grammar as a foreign language. arXiv preprint arXiv:1412.7449.
Sida Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94. Association for Computational Linguistics.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2015. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626.
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630.

0 Abstract

1 Introduction

2 Hierarchical Attention Networks

2.1 GRU-based sequence encoder

2.2 Hierarchical Attention

2.3 Document Classification

3 Experiments

3.1 Data sets

3.2 Baselines

3.2.1 Linear methods

3.2.2 SVMs

3.2.3 Neural Network methods

3.3 Model configuration and training

3.4 Results and analysis

3.5 Context dependent attention weights

3.6 Visualization of attention

4 Related Work

5 Conclusion

Referencesd